ggplot2 for longitudinal dataRecall from last week the R packages dplyr, for data processing and management and ggplot2, which produces high-quality graphics. Let’s bring in these two packages as well as the data from last week (study on effect of diet on Australian cows’ milk protein):
library(dplyr)
library(ggplot2)
# change directories to where this file sits on your own computer!
milk_long <- read.table("~/Documents/!PhD-UW/CSSS-594/ALDA/milk_long.csv",
header=TRUE,
sep=",")
head(milk_long)## Diet Cow Week Protein
## 1 Barley 1 1 3.63
## 2 Barley 1 2 3.57
## 3 Barley 1 3 3.47
## 4 Barley 1 4 3.65
## 5 Barley 1 5 3.89
## 6 Barley 1 6 3.73
Let’s make a ggplot of all cows’ protein levels by week:
milk_graph <- ggplot(data=milk_long,
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_point() +
geom_line() +
ggtitle("Protein levels in milk by week for each cow") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))
milk_graph# How can we improve this?In this example, we’re using:
aes):
x: \(x\)-axis, our time variable (here, Week)y: \(y\)-axis, our outcome variable (here, Protein)group: our unit variable (here, Cow)color: variable we want to use to segment by color, usually should be coded as a factor variable (here, Diet)+) with a set of () taking arguments:
ggplot: base layer containing information about the data and overall aesthetics with aes()geom_point: scatterplot points at \(x\) and \(y\) valuesgeom_line: line connecting values within each groupggtitle: title for the plot.Additional aesthetics we can use are shape (symbols used for points, usually should be coded as a factor variable), linetype, size (dot size/line thickness), alpha (transparency level, 0 for fully transparent and 1 for fully opaque), and fill (color for areas like bar charts).
When we want to change the appearance of parts of the graph in general that aren’t related to specific variables (e.g. use large sized symbols for the geom_point layer), we pass those as non-aesthetic options to the layers; that is, in the specific layer and not enclosed in an aes() group in the ggplot() main layer. For more information on modifying colors, legends, etc., see http://www.cookbook-r.com/Graphs/index.html. For specific R color names, see http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. Shapes and linetypes can be modified in a similar way, with more information at http://www.cookbook-r.com/Graphs/Shapes_and_line_types.
Recall we can improve graph legibility by using facet_grid to segment by diet:
# We decided that the vertical view was more useful
milk_graph +
facet_grid(Diet ~ .)If we had two variables to facet by (such as diet and species), the one before the ~ denotes how we split down facet rows, and the one after the ~ denotes how we split across facet columns.
We can even facet down to the individual cow to get a panel for each of them, with a facet_wrap( ~ Cow) command. But that’s a lot of cows - 79 to be exact. That’ll be hard to see and slow to render. We can sample some at random and use the %in% operator in R to check if IDs are in the sampled IDs list:
# setting a seed in R makes your "random" number generation reproducible
set.seed(10715)
# sample 20 cow IDs without replacement from the set of cow IDs
sampled_cows <- sample(unique(milk_long$Cow), size=20, replace=FALSE)
sampled_cows## [1] 17 18 55 53 65 16 52 46 49 67 26 68 30 38 31 1 59 37 6 63
# apply filtering to the data to keep the cows appearing in sampled_cows
# note use of the %in% operator
ggplot(data=milk_long %>%
filter(Cow %in% sampled_cows),
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_line(alpha=0.5) +
ggtitle("Protein levels in milk by week for sampled cow") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue")) +
# 5 row, 4 column layout
facet_wrap( ~ Cow, ncol=4)With facets, the axes are automatically set to the same scale, unless you add the option scales="free" to the facet_grid or facet_wrap layer. We can also set the \(y\)-axis manually by adding a layer + ylim(0,5) to force the range between 0 and 5, or the \(x\)-axis manually + xlim(0,20) to force the range between 0 and 20. Sometimes we will want to set the axes manually when the automatic choices exaggerate the amount of variation in trajectories.
For more on facets, see http://www.cookbook-r.com/Graphs/Facets_(ggplot2).
Using dplyr, last week we used group_by to do summaries at different levels of the data. We can use other non-summary functions at the level of our groups by using the do function. We can fit a linear regression for each cow with protein as the outcome and week as the explanatory variable (i.e., fit a separate linear time trend for the protein in each cow’s milk). Use the following code to get a data frame with one row per cow and one column each for its linear model’s intercept, slope, and \(R^2\) value:
milk_regression <- milk_long %>%
group_by(Diet, Cow) %>%
# use 'do' to run a linear regression with the grouped observations
# and store the summary object in a variable called indiv_model
do(indiv_model = summary(lm(Protein ~ Week, data=.))) %>%
# get the intercept, slope (coefficient of Week), and R^2 from the model objects
mutate(intercept = indiv_model[["coefficients"]]["(Intercept)","Estimate"],
slope = indiv_model[["coefficients"]]["Week","Estimate"],
r2 = indiv_model[["r.squared"]]) %>%
# drop the indiv_model object for each subject since we're done with it
select(-indiv_model)
head(milk_regression)## # A tibble: 6 x 5
## Diet Cow intercept slope r2
## <fctr> <int> <dbl> <dbl> <dbl>
## 1 Barley 1 3.478596 0.0418245614 0.810377937
## 2 Barley 2 3.250877 0.0009649123 0.001565978
## 3 Barley 3 3.745824 -0.0729670330 0.791993838
## 4 Barley 4 3.105752 0.0089267286 0.040756178
## 5 Barley 5 3.872281 -0.0202280702 0.177253115
## 6 Barley 6 4.009890 -0.0278901099 0.216735206
How did I know to focus on the lm summary object for this information? Let’s drill a little deeper into lm and its related objects. Start by making a simple linear model for one of the cows, say the fifth one:
# look at just one cow
this_cow <- milk_long %>%
filter(Cow==5)
head(this_cow)## Diet Cow Week Protein
## 1 Barley 5 1 4.34
## 2 Barley 5 2 3.76
## 3 Barley 5 3 3.68
## 4 Barley 5 4 3.51
## 5 Barley 5 5 3.45
## 6 Barley 5 6 3.53
# fit a regression
this_cow_line <- lm(Protein ~ Week, data=this_cow)
# What is this_cow_line?
class(this_cow_line)## [1] "lm"
# What's inside that object?
names(this_cow_line)## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
# I see coefficients but not r squared. How about in the lm summary?
class(summary(this_cow_line))## [1] "summary.lm"
names(summary(this_cow_line))## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
Since we see that both values we’re searching for are in the summary.lm object, not the lm object, use that (i.e., make the do command above be a summary(lm()) command).
Recall that regression models are defined by their slope(s) and intercept. How are good ways to display these across the many cow-specific linear models in milk_regression.
We’ve seen stem-and-leaf plots of the intercepts and slopes for the level-1 regressions. There is a function for this in base R called stem:
stem(milk_regression$intercept, scale=2)##
## The decimal point is 1 digit(s) to the left of the |
##
## 29 | 679
## 30 | 134
## 31 | 115789
## 32 | 134558
## 33 | 01236778
## 34 | 0012366899
## 35 | 002233456888899
## 36 | 2557
## 37 | 056778889
## 38 | 003788
## 39 | 2237
## 40 | 19
## 41 | 4
## 42 | 03
We can make histograms or density plots using ggplot2 to examine the distributions of intercepts and slopes:
# histograms
# note we only have x aesthetic, there is no "y" as this is created by geom_histogram
ggplot(milk_regression, aes(x=intercept)) +
geom_histogram(fill="magenta", binwidth=0.1) +
ggtitle("Intercepts from linear regressions fit to each cow")# density estimates
ggplot(milk_regression, aes(x=intercept)) +
geom_density(fill="magenta") +
ggtitle("Intercepts from linear regressions fit to each cow")With histograms, you may want to use warning=FALSE, message=FALSE in your chunk options in your .Rmd file since you will often get warnings and messages if you don’t set binwidth and let ggplot2 choose it for you. For density plots, you can adjust the bandwidth to make the plot more or less smooth using the adjust option in geom_density, which defaults to 1 (so a smaller number yields a smoother density curver and a large number makes a more jagged curve).
See http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2) for more on ggplot2 histograms and density plots.
Try this out for the linear regression slopes. What patterns do you see? Try changing the adjust argument in geom_density - which value of adjust looks reasonable?
We can look at the relationship between fitted intercepts and slopes with a scatterplot:
ggplot(data=milk_regression, aes(x=intercept, y=slope)) +
# We could also break these out by diet using color or facets for another view
geom_point() +
ggtitle("Slope vs. intercept for level-1 regressions")With ggplot2, you don’t need to manually fit a regression for each subject to plot them across groups. The package can do it all for you with a geom_line layer with the argument stat="smooth" that applies to each aesthetic group separately:
# look at the level 1 regressions faceted out for the sample of cows
ggplot(data=milk_long %>%
# just the sampled cows using %in%
filter(Cow %in% sampled_cows),
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_point() +
# adds fitted lines for each group to the plot using lm
geom_line(stat="smooth", method="lm") +
ggtitle("Protein levels in milk by week for sampled cow with linear fit") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue")) +
facet_wrap( ~ Cow, ncol=4)For a non-parametric fit, try replacing method="lm" with method="loess" in geom_line. How does the linear fit compare with the non-linear non-parametric fit?
We can also look at all cows’ fitted regression lines on one plot by dropping the facets and the scatterplot layer and adding some transparency:
ggplot(data=milk_long,
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_line(stat="smooth", method="lm", alpha=0.75) +
ggtitle("Linear fit for each cow's trajectory") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))We can even specify using our own formula if we want to try, say, a quadratic trajectory for each cow:
ggplot(data=milk_long,
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
# using a formula quadratic in time: y = a + b x + c x^2 + error
# the I(x^2) tells R to compute x^2 and handle this term separately
# you use standard R regression syntax in formula
geom_line(stat="smooth", method="lm",
formula= y ~ x + I(x^2), alpha=0.75) +
ggtitle("Quadratic fit for each cow's trajectory") +
theme_bw() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))We can add average trajectories with an additional geom_line(stat="smooth") layer, but this time overriding the aesthetics in the layer so that the group is all observations (aes(group="1")) or a subgroup of interest (aes(group=Diet)).
ggplot(data=milk_long,
aes(x=Week, y=Protein, group=Cow, color=Diet)) +
geom_line(stat="smooth", method="lm", alpha=0.5) +
# add a layer for average linear models by group, in thicker lines (the size argument)
# keeping color=Diet in the aes so it uses the same colors as in the individual lines, and adding group=Diet so the models are grouped by diet
geom_line(stat="smooth", aes(group=Diet, color=Diet), method="lm", size=3, alpha=0.75) +
# add a layer for the overall mean, in thick (size argument) dashed (linetype argument) line
# use group="1" in the aes for an overall summary
geom_line(stat="smooth", aes(group="1"), color="black", method="lm", size=2, linetype="dashed", alpha=0.75) +
ggtitle("Linear fit for each cow's trajectory") +
theme_classic() +
scale_color_manual(values=c("mediumorchid4", "darkgoldenrod1", "deepskyblue"))There is a TON more you can do with ggplot2! Get inspiration at http://www.r-graph-gallery.com/portfolio/ggplot2-package/.
tidyr for reshaping datatidyr is a companion package to dplyr that is designed to reshape your data and uses the same piping %>% syntax. You can also use the reshape function in base R or functions in other packages to accomplish the same goal.
As mentioned in lecture 2, usually we want our data in “long” format, which is better at accomodating irregular observations and necessary for plotting.
# install.packages("tidyr")
library(tidyr)
# wide version of milk data
milk_wide <- read.table("~/Documents/!PhD-UW/CSSS-594/ALDA/milk_wide.csv",
header=TRUE,
sep=",")
head(milk_wide) # Note the NAs!## Diet Cow Wk1 Wk2 Wk3 Wk4 Wk5 Wk6 Wk7 Wk8 Wk9 Wk10 Wk11 Wk12
## 1 Barley 1 3.63 3.57 3.47 3.65 3.89 3.73 3.77 3.90 3.78 3.82 3.83 3.71
## 2 Barley 2 3.24 3.25 3.29 3.09 3.38 3.33 3.00 3.16 3.34 3.32 3.31 3.27
## 3 Barley 3 3.98 3.60 3.43 3.30 3.29 3.25 2.93 3.20 3.27 3.22 2.93 2.92
## 4 Barley 4 3.66 3.50 3.05 2.90 2.72 3.11 3.05 2.80 3.20 3.18 3.14 3.18
## 5 Barley 5 4.34 3.76 3.68 3.51 3.45 3.53 3.60 3.77 3.90 3.87 3.61 3.85
## 6 Barley 6 4.36 3.71 3.42 3.95 4.06 3.73 3.92 3.99 3.70 3.88 3.71 3.62
## Wk13 Wk14 Wk15 Wk16 Wk17 Wk18 Wk19
## 1 4.10 4.02 4.13 4.08 4.22 4.44 4.30
## 2 3.41 3.45 3.12 3.42 3.40 3.17 3.00
## 3 2.82 2.64 NA NA NA NA NA
## 4 3.24 3.37 3.30 3.40 3.35 3.28 NA
## 5 3.94 3.87 3.60 3.06 3.47 3.50 3.42
## 6 3.74 3.42 NA NA NA NA NA
To get data into long format (= person-period level) from wide format (= person level), we use the gather function in tidyr:
milk_wide_to_long <- milk_wide %>%
# gather takes these arguments:
# key=name of the longitudinal variable to be created
# value=name of the variable measured at each occasion
# the rest are names of what you want to go from individual columns to values under the longitudinal variable
# note that any variable not mentioned will be unchanged
gather(key=Time, value=Protein, Wk1:Wk19) %>%
# sort it by Cow and Time using arrange
arrange(Cow, Time)
head(milk_wide_to_long)## Diet Cow Time Protein
## 1 Barley 1 Wk1 3.63
## 2 Barley 1 Wk10 3.82
## 3 Barley 1 Wk11 3.83
## 4 Barley 1 Wk12 3.71
## 5 Barley 1 Wk13 4.10
## 6 Barley 1 Wk14 4.02
This worked pretty well, but I’d rather have numeric Time values instead of “Wk1”, “Wk2”, etc. so that I can plot things in the right order later. We can use the separate command in tidyr to fix this and split at the third position to take off the “Wk”:
milk_wide_to_long <- milk_wide %>%
gather(key=Time, value=Protein, Wk1:Wk19) %>%
# separate takes these arguments:
# col=column we want to split up into two
# into=character strings for the names of new columns we want to create
# sep=the character at which we want to split the col variable
# (could also split based on punctuation - see ?separate)
# convert=TRUE means take anything that looks like a number and make it a number instead of character
separate(col=Time, into=c("prefix","Week"), sep=2, convert=TRUE) %>%
# now drop prefix since we don't need it (note the minus sign for dropping a variable)
select(-prefix) %>%
# sort by Cow and then the cleaned Week number
arrange(Cow, Week)
head(milk_wide_to_long)## Diet Cow Week Protein
## 1 Barley 1 1 3.63
## 2 Barley 1 2 3.57
## 3 Barley 1 3 3.47
## 4 Barley 1 4 3.65
## 5 Barley 1 5 3.89
## 6 Barley 1 6 3.73
Finally, there are some missing values (NA’s) because gather assumes fixed number of measurement ocassions but some cows weren’t observed every week. We can drop these in long format by filtering out those rows, which we can identify using the is.na command (returns TRUE when NA) and using ! to mean NOT these rows:
milk_wide_to_long <- milk_wide_to_long %>%
filter(!is.na(Protein))Occasionally “wide” format is useful, such as in computing correlations (e.g. correlation between cows’ protein values at Week 1 and Week 2). If we didn’t already have a wide version of the file and wanted it, we use the spread function in tidyr, which is the inverse of the gather function:
milk_long_to_wide <- milk_long %>%
# make a column called week_name that has "Wk"+number, so we get variable names that aren't plain numbers
# paste0 is a function for concatenating text without adding spaces
mutate(week_name=paste0("Wk",Week)) %>%
# we can get rid of Week now because we have the new column
select(-Week) %>%
# spread takes these arguments:
# key=the variable to use for column names for the new columns in wide format,
# value=the variable to stick in as values for those columns
spread(key=week_name, value=Protein) %>%
# the new columns are in alphabetical order, not numeric order
# (Wk1, Wk10, Wk11, ..., Wk19, Wk2, Wk3, etc.)
# so use select to reorder them in the order we list
select(Diet, Cow, Wk1, Wk2:Wk9, Wk10:Wk19)
head(milk_long_to_wide)## Diet Cow Wk1 Wk2 Wk3 Wk4 Wk5 Wk6 Wk7 Wk8 Wk9 Wk10 Wk11 Wk12
## 1 Barley 1 3.63 3.57 3.47 3.65 3.89 3.73 3.77 3.90 3.78 3.82 3.83 3.71
## 2 Barley 2 3.24 3.25 3.29 3.09 3.38 3.33 3.00 3.16 3.34 3.32 3.31 3.27
## 3 Barley 3 3.98 3.60 3.43 3.30 3.29 3.25 2.93 3.20 3.27 3.22 2.93 2.92
## 4 Barley 4 3.66 3.50 3.05 2.90 2.72 3.11 3.05 2.80 3.20 3.18 3.14 3.18
## 5 Barley 5 4.34 3.76 3.68 3.51 3.45 3.53 3.60 3.77 3.90 3.87 3.61 3.85
## 6 Barley 6 4.36 3.71 3.42 3.95 4.06 3.73 3.92 3.99 3.70 3.88 3.71 3.62
## Wk13 Wk14 Wk15 Wk16 Wk17 Wk18 Wk19
## 1 4.10 4.02 4.13 4.08 4.22 4.44 4.30
## 2 3.41 3.45 3.12 3.42 3.40 3.17 3.00
## 3 2.82 2.64 NA NA NA NA NA
## 4 3.24 3.37 3.30 3.40 3.35 3.28 NA
## 5 3.94 3.87 3.60 3.06 3.47 3.50 3.42
## 6 3.74 3.42 NA NA NA NA NA
We need to be wary of NAs introduced into the wide data frame. Some functions have default NA handling; others require us to specify what to do with them. For example:
# What's the minimum value of Wk19 for all cows?
min(milk_long_to_wide$Wk19) # Won't work## [1] NA
min(milk_long_to_wide$Wk19, na.rm=T) # Tell min() how to handle NAs## [1] 2.66
We can use the “wide” data to look at the correlation between protein values cows had at one week and any another week in the study.
This code gets us the sample correlation matrix among protein levels in the first 6 weeks:
milk_correlation <- milk_wide %>%
# just look at the first 6 weeks
select(Wk1:Wk6) %>%
# there are some missing values, let's use only observations with all 6 present ("complete.obs"). (Use this only when you know you don't need all observations!)
cor(x=., use="complete.obs")
# correlation matrix formatted nicely with pander and a caption
library(pander)
# putting in some options to keep the decimal points under control
panderOptions('round', 2)
panderOptions('keep.trailing.zeros', TRUE)
pander(milk_correlation,
# Don't forget to mention we're only using the complete observations!
caption="Correlation among milk protein levels in first 6 weeks, complete observations only")| Wk1 | Wk2 | Wk3 | Wk4 | Wk5 | Wk6 | |
|---|---|---|---|---|---|---|
| Wk1 | 1.00 | 0.46 | 0.47 | 0.32 | 0.34 | 0.28 |
| Wk2 | 0.46 | 1.00 | 0.51 | 0.52 | 0.51 | 0.38 |
| Wk3 | 0.47 | 0.51 | 1.00 | 0.62 | 0.59 | 0.39 |
| Wk4 | 0.32 | 0.52 | 0.62 | 1.00 | 0.69 | 0.57 |
| Wk5 | 0.34 | 0.51 | 0.59 | 0.69 | 1.00 | 0.55 |
| Wk6 | 0.28 | 0.38 | 0.39 | 0.57 | 0.55 | 1.00 |